Fundamental Techniques in Data Science with R

Today

This lecture

  • Data manipulation

  • for-loops, if statements

  • custom functions

  • Basic analysis (correlation & t-test)

  • Pipes

  • Data visualization with ggplot2

The absolute basics

We assign elements to objects in R like this:

a <- 100
b <- a^2

We can combine elements into multiple types of objects:

  • vectors: a combination of elements into 1 dimension
  • matrices: a combination of vectors into 2 dimensions
  • arrays: a combination of matrices into more than 2 dimensions

Vectors, matrices and arrays can be either numeric or character. Objects that can combine numeric and character information are:

  • dataframes: a combination of equal length vectors into 2 dimensions
  • lists: a combination of whatever you want

Vectors

c(1, 3, 5, 2, 6, 4)
## [1] 1 3 5 2 6 4
v <- 1:6
v
## [1] 1 2 3 4 5 6
v[4]
## [1] 4

Matrices

m <- matrix(1:6, nrow = 3, ncol = 2)
m
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
m[1, ]
## [1] 1 4
m[, 2]
## [1] 4 5 6

Data frames

d <- data.frame(numbers = 1:3, 
                letters = c("a", "b", "c"),
                V3 = c(TRUE, FALSE, TRUE))
d
##   numbers letters    V3
## 1       1       a  TRUE
## 2       2       b FALSE
## 3       3       c  TRUE
d$numbers
## [1] 1 2 3
d$letters
## [1] a b c
## Levels: a b c

Lists

l <- list(vector = v, dframe = d)
l
## $vector
## [1] 1 2 3 4 5 6
## 
## $dframe
##   numbers letters    V3
## 1       1       a  TRUE
## 2       2       b FALSE
## 3       3       c  TRUE
l$vector
## [1] 1 2 3 4 5 6
l$dframe$numbers
## [1] 1 2 3

Some programming tips:

  • keep your code tidy
  • use comments (text preceded by #) to clarify what you are doing
    • If you look at your code again, one month from now: you will not know what you did –> unless you use comments
  • when working with functions, use the TAB key to quickly access the help for the function’s components
  • work with logically named R-scripts
    • indicate the sequential nature of your work
  • work with RStudio projects
  • if allowed, place your project folders in some cloud-based environment

Functions

Functions have parentheses (). Names directly followed by parentheses always indicate functions. For example;

  • matrix() is a function
  • c() is a function
  • but (1 - 2) * 5 is a calculation, not a function

Packages

Packages give additional functionality to R.

By default, some packages are included. These packages allow you to do mainstream statistical analyses and data manipulation. Installing additional packages allow you to perform the state of the art in statistical programming and estimation.

The cool thing is that these packages are all developed by users. The throughput process is therefore very timely:

  • newly developed functions and software are readily available
  • this is different from other mainstream software, like SPSS, where new methodology may take years to be implemented.

A list of available packages can be found on CRAN

Loading packages

Packages extend the basic functionality of R.

There are two ways to load a package in R

library(stats)

and

require(stats)

Installing packages

The easiest way to install e.g. package mice is to use

install.packages("mice")

Alternatively, you can also do it in RStudio through

Tools --> Install Packages

R in depth

Workspaces and why you should sometimes save them

A workspace contains all changes you made to R.

A saved workspace contains everything at the time of the state wherein it was saved.

You do not need to run all the previous code again if you would like to continue working at a later time.

  • You can save the workspace and continue exactly where you left.

Workspaces are compressed and require relatively little memory when stored. The compression is very efficient and beats reloading large datasets.

History and why it is useful

R by default saves (part of) the code history and RStudio expands this functionality greatly.

Most often it may be useful to look back at the code history for various reasons.

  • There are multiple ways to access the code history.

    1. Use arrow up in the console. This allows you to go back in time, one codeline by one. Extremely useful to go back to previous lines for minor alterations to the code.
    2. Use the history tab in the environment pane. The complete project history can be found here and the history can be searched. This is particularly convenient when you know what code you are looking for.

Working in projects in RStudio

  • Every project has its own history
  • Every research project has its own project
  • Every project can have its own folder, which also serves as a research archive
  • Every project can have its own version control system
  • R-studio projects can relate to Git (or other online) repositories

In general…

  • Use common sense and BE CONSISTENT.

  • Browse through the tidyverse style guide

    • The point of having style guidelines is to have a common vocabulary of coding
    • so people can concentrate on what you are saying, rather than on how you are saying it.
  • If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers and collaborators out of their rhythm when they go to read it. Try to avoid this.

  • Intentional spacing makes your code easier to interpret

    • a<-c(1,2,3,4,5) vs;
    • a <- c(1, 2, 3, 4, 5)
  • at least put a space after every comma!

Packages we use in these slides

library(MASS)     # for the cats data
library(dplyr)    # data manipulation
library(haven)    # in/exporting data
library(magrittr) # pipes
library(mice)     # for the boys data
library(ggplot2)  # visualization

Key functions

  • transform(): changing and adding columns
  • dplyr::filter(): row-wise selection (of cases)
  • table(): frequency tables
  • class(): object class
  • levels(): levels of a factor
  • haven::read_sav(): import SPSS data
  • cor(): bivariate correlation
  • sample(): drawing a sample
  • t.test(): t-test

Data manipulation

The cats data

head(cats)
##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6
str(cats)
## 'data.frame':    144 obs. of  3 variables:
##  $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
##  $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...

How to get only Female cats?

fem.cats <- cats[cats$Sex == "F", ]
dim(fem.cats)
## [1] 47  3
head(fem.cats)
##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6

How to get only heavy cats?

heavy.cats <- cats[cats$Bwt > 3, ]
dim(heavy.cats)
## [1] 36  3
head(heavy.cats)
##     Sex Bwt  Hwt
## 109   M 3.1  9.9
## 110   M 3.1 11.5
## 111   M 3.1 12.1
## 112   M 3.1 12.5
## 113   M 3.1 13.0
## 114   M 3.1 14.3

How to get only heavy cats?

heavy.cats <- subset(cats, Bwt > 3)
dim(heavy.cats)
## [1] 36  3
head(heavy.cats)
##     Sex Bwt  Hwt
## 109   M 3.1  9.9
## 110   M 3.1 11.5
## 111   M 3.1 12.1
## 112   M 3.1 12.5
## 113   M 3.1 13.0
## 114   M 3.1 14.3

another way: dplyr

filter(cats, Bwt > 2, Bwt < 2.2, Sex == "F")
##   Sex Bwt Hwt
## 1   F 2.1 7.2
## 2   F 2.1 7.3
## 3   F 2.1 7.6
## 4   F 2.1 8.1
## 5   F 2.1 8.2
## 6   F 2.1 8.3
## 7   F 2.1 8.5
## 8   F 2.1 8.7
## 9   F 2.1 9.8

Working with factors

class(cats$Sex)
## [1] "factor"
levels(cats$Sex)
## [1] "F" "M"

Working with factors

levels(cats$Sex) <- c("Female", "Male")
table(cats$Sex)
## 
## Female   Male 
##     47     97
head(cats)
##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4 Female 2.1 7.2
## 5 Female 2.1 7.3
## 6 Female 2.1 7.6

Sorting

sort1 <- arrange(cats, Bwt)
head(sort1)
##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4   Male 2.0 6.5
## 5   Male 2.0 6.5
## 6 Female 2.1 7.2
sort2 <- arrange(cats, desc(Bwt))
head(sort2)
##    Sex Bwt  Hwt
## 1 Male 3.9 14.4
## 2 Male 3.9 20.5
## 3 Male 3.8 14.8
## 4 Male 3.8 16.8
## 5 Male 3.7 11.0
## 6 Male 3.6 11.8

Combining matrices or dataframes

cats.numbers <- cbind(Weight = cats$Bwt, HeartWeight = cats$Hwt)
head(cats.numbers)
##      Weight HeartWeight
## [1,]    2.0         7.0
## [2,]    2.0         7.4
## [3,]    2.0         9.5
## [4,]    2.1         7.2
## [5,]    2.1         7.3
## [6,]    2.1         7.6

Combining matrices or dataframes

rbind(cats[1:3, ], cats[1:5, ])
##      Sex Bwt Hwt
## 1 Female 2.0 7.0
## 2 Female 2.0 7.4
## 3 Female 2.0 9.5
## 4 Female 2.0 7.0
## 5 Female 2.0 7.4
## 6 Female 2.0 9.5
## 7 Female 2.1 7.2
## 8 Female 2.1 7.3

Basic analysis

Correlation

cor(cats[, -1])
##           Bwt       Hwt
## Bwt 1.0000000 0.8041274
## Hwt 0.8041274 1.0000000

With [, -1] we exclude the first column

Correlation

cor.test(cats$Bwt, cats$Hwt)
## 
##  Pearson's product-moment correlation
## 
## data:  cats$Bwt and cats$Hwt
## t = 16.119, df = 142, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7375682 0.8552122
## sample estimates:
##       cor 
## 0.8041274

What do we conclude?

Correlation

plot(cats$Bwt, cats$Hwt)

T-test

Test the null hypothesis that the difference in mean heart weight between male and female cats is 0

t.test(formula = Hwt ~ Sex, data = cats)
## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = -6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.763753 -1.477352
## sample estimates:
## mean in group Female   mean in group Male 
##             9.202128            11.322680

T-test

plot(formula = Hwt ~ Sex, data = cats)

controls and flows

if | for | apply | functions

Automation

  • If-statements

  • For-loops

  • Apply()

  • Writing your own functions

New controls and functions

New control flow constructs

  • if(cond) expr
  • if(cond) cons.expr else alt.expr
  • for(var in seq) expr

New functions

  • rev(): reverse version of an argument
  • apply(): apply a function to margins of a matrix
  • sapply(): apply a function to elements of a list, vector or matrix return
  • lapply(): apply a function to elements of a list, list return
  • print(): print an object to the console
  • cat: outputs an object with less conversion than print()

Conditionals and loops

If-statements

Often, we want to run some code only if some condition is true.

For example:

a <- 2
a > 5
## [1] FALSE
if (a > 5){
  print("a is larger than 5.")
}
a <- 8
if (a > 5){
  print("a is larger than 5.")
}
## [1] "a is larger than 5."

If-else-statements

We can also specify something to be run if the condition is not true.

a <- 2
if (a > 5){
  print("a is larger than 5.")
} else {
  print("a is smaller than 5.")
}
## [1] "a is smaller than 5."

If-else-statements

a <- 8
if (a > 5){
  print("a is larger than 5.")
} else {
  print("a is smaller than 5.")
}
## [1] "a is larger than 5."

For-loops

For loops are used when we want to perform some repetitive calculations.

It is often tedious, or even impossible, to write this repetition out completely.

For-loops

# Let's print the numbers 1 to 6 one by one. 
print(1)
## [1] 1
print(2)
## [1] 2
print(3)
## [1] 3
print(4)
## [1] 4
print(5)
## [1] 5
print(6)
## [1] 6

For-loops

For-loops allow us to automate this!

for (i in 1:6){
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6

For-loops

for (i in 1:6){
  print(i < 5)
}
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] FALSE
## [1] FALSE

For-loops

for (i in 1:nrow(cats)){
  if (cats$Bwt[i] > 2.5){
    cat(i, "is over 2.5. It is:", cats$Bwt[i], "\n")
  }
}
## 37 is over 2.5. It is: 2.6 
## 38 is over 2.5. It is: 2.6 
## 39 is over 2.5. It is: 2.6 
## 40 is over 2.5. It is: 2.7 
## 41 is over 2.5. It is: 2.7 
## 42 is over 2.5. It is: 2.7 
## 43 is over 2.5. It is: 2.9 
## 44 is over 2.5. It is: 2.9 
## 45 is over 2.5. It is: 2.9 
## 46 is over 2.5. It is: 3 
## 47 is over 2.5. It is: 3 
## 73 is over 2.5. It is: 2.6 
## 74 is over 2.5. It is: 2.6 
## 75 is over 2.5. It is: 2.6 
## 76 is over 2.5. It is: 2.6 
## 77 is over 2.5. It is: 2.6 
## 78 is over 2.5. It is: 2.6 
## 79 is over 2.5. It is: 2.7 
## 80 is over 2.5. It is: 2.7 
## 81 is over 2.5. It is: 2.7 
## 82 is over 2.5. It is: 2.7 
## 83 is over 2.5. It is: 2.7 
## 84 is over 2.5. It is: 2.7 
## 85 is over 2.5. It is: 2.7 
## 86 is over 2.5. It is: 2.7 
## 87 is over 2.5. It is: 2.7 
## 88 is over 2.5. It is: 2.8 
## 89 is over 2.5. It is: 2.8 
## 90 is over 2.5. It is: 2.8 
## 91 is over 2.5. It is: 2.8 
## 92 is over 2.5. It is: 2.8 
## 93 is over 2.5. It is: 2.8 
## 94 is over 2.5. It is: 2.8 
## 95 is over 2.5. It is: 2.9 
## 96 is over 2.5. It is: 2.9 
## 97 is over 2.5. It is: 2.9 
## 98 is over 2.5. It is: 2.9 
## 99 is over 2.5. It is: 2.9 
## 100 is over 2.5. It is: 3 
## 101 is over 2.5. It is: 3 
## 102 is over 2.5. It is: 3 
## 103 is over 2.5. It is: 3 
## 104 is over 2.5. It is: 3 
## 105 is over 2.5. It is: 3 
## 106 is over 2.5. It is: 3 
## 107 is over 2.5. It is: 3 
## 108 is over 2.5. It is: 3 
## 109 is over 2.5. It is: 3.1 
## 110 is over 2.5. It is: 3.1 
## 111 is over 2.5. It is: 3.1 
## 112 is over 2.5. It is: 3.1 
## 113 is over 2.5. It is: 3.1 
## 114 is over 2.5. It is: 3.1 
## 115 is over 2.5. It is: 3.2 
## 116 is over 2.5. It is: 3.2 
## 117 is over 2.5. It is: 3.2 
## 118 is over 2.5. It is: 3.2 
## 119 is over 2.5. It is: 3.2 
## 120 is over 2.5. It is: 3.2 
## 121 is over 2.5. It is: 3.3 
## 122 is over 2.5. It is: 3.3 
## 123 is over 2.5. It is: 3.3 
## 124 is over 2.5. It is: 3.3 
## 125 is over 2.5. It is: 3.3 
## 126 is over 2.5. It is: 3.4 
## 127 is over 2.5. It is: 3.4 
## 128 is over 2.5. It is: 3.4 
## 129 is over 2.5. It is: 3.4 
## 130 is over 2.5. It is: 3.4 
## 131 is over 2.5. It is: 3.5 
## 132 is over 2.5. It is: 3.5 
## 133 is over 2.5. It is: 3.5 
## 134 is over 2.5. It is: 3.5 
## 135 is over 2.5. It is: 3.5 
## 136 is over 2.5. It is: 3.6 
## 137 is over 2.5. It is: 3.6 
## 138 is over 2.5. It is: 3.6 
## 139 is over 2.5. It is: 3.6 
## 140 is over 2.5. It is: 3.7 
## 141 is over 2.5. It is: 3.8 
## 142 is over 2.5. It is: 3.8 
## 143 is over 2.5. It is: 3.9 
## 144 is over 2.5. It is: 3.9

The apply() family

apply()

The apply family is a group of very useful functions that allow you to easily execute a function of your choice over a list of objects, such as a list, a data.frame, or matrix.

We will look at three examples:

  • apply

  • sapply

  • lapply

apply()

apply is used for matrices (and sometimes dataframes). It can take a function that takes a vector as input, and apply it to each row or column.

apply()

MARGIN is 1 for rows, 2 for columns.

apply(cats[, -1], MARGIN = 2, mean)
##       Bwt       Hwt 
##  2.723611 10.630556

But we’ve seen this done easier:

colMeans(cats[, -1])
##       Bwt       Hwt 
##  2.723611 10.630556

However, the power of apply() is that it can use any function we throw at it.

apply()

rand.mat <- matrix(rnorm(21), nrow = 3, ncol = 7)
rand.mat
##            [,1]        [,2]       [,3]        [,4]       [,5]       [,6]
## [1,]  0.1594086 -0.03658331  1.0583128 -0.28745910 -0.6437308 0.22441955
## [2,] -1.0715401  1.99749051 -0.4039843  0.65321112  1.8323755 0.64415613
## [3,] -0.1565307 -0.02856474  1.6409261 -0.07785777  1.2787334 0.04713398
##            [,7]
## [1,]  0.6833758
## [2,] -0.2313327
## [3,] -2.4995756
apply(rand.mat, MARGIN = 1, FUN = max)
## [1] 1.058313 1.997491 1.640926
apply(rand.mat, MARGIN = 2, FUN = max)
## [1] 0.1594086 1.9974905 1.6409261 0.6532111 1.8323755 0.6441561 0.6833758

apply()

rand.mat
##            [,1]        [,2]       [,3]        [,4]       [,5]       [,6]
## [1,]  0.1594086 -0.03658331  1.0583128 -0.28745910 -0.6437308 0.22441955
## [2,] -1.0715401  1.99749051 -0.4039843  0.65321112  1.8323755 0.64415613
## [3,] -0.1565307 -0.02856474  1.6409261 -0.07785777  1.2787334 0.04713398
##            [,7]
## [1,]  0.6833758
## [2,] -0.2313327
## [3,] -2.4995756
apply(rand.mat, MARGIN = 1, FUN = sum)
## [1] 1.1577436 3.4203762 0.2042647
apply(rand.mat, MARGIN = 2, FUN = var)
## [1] 0.4087157 1.3737367 1.1099016 0.2438758 1.6889151 0.0940074 2.6854662

sapply()

sapply() is used on list-objects and returns a matrix

my.list <- list(A = c(4, 2, 1:3), B = "Hello.", C = TRUE)
sapply(my.list, class)
##           A           B           C 
##   "numeric" "character"   "logical"
sapply(my.list, range)
##      A   B        C  
## [1,] "1" "Hello." "1"
## [2,] "4" "Hello." "1"

It returns a vector or a matrix, depending on the output of the function.

Why is each element a character string?

sapply()

Any data.frame is also a list, where each column is one list-element.

class(cats)
## [1] "data.frame"
is.list(cats)
## [1] TRUE

This means we can use sapply on data frames as well, which is often useful.

sapply(cats, class)
##       Sex       Bwt       Hwt 
##  "factor" "numeric" "numeric"

lapply()

lapply() is exactly the same as sapply(), but it returns a list instead of a vector.

lapply(cats, class)
## $Sex
## [1] "factor"
## 
## $Bwt
## [1] "numeric"
## 
## $Hwt
## [1] "numeric"

Writing your own functions

What are functions?

Functions are reusable pieces of code that take an input, do some computation on the input, and return output.

We have been using a lot of functions: code of the form something() is usually a function.

mean(1:6)
## [1] 3.5

Our own function

The apply class of functions is very flexible and lightning fast, when compared to manual operations that could easily be defined in terms of functions.

The only caveat is that you need a function to apply. Many such functions are already available in R, such as mean(), mode(), sum(), cor(), and so on.

However, if you need to perform more than a simple calculation, it is often necessary to create your own function. In R functions take the following form

myfunction <- function(arguments){
  hereyourfunctioncode
}

A function example

mean.sd <- function(argument1, argument2){
  mean1 <- mean(argument1) 
  mean2 <- mean(argument2)
  sd1 <- sd(argument1)
  sd2 <- sd(argument2)
  result <- data.frame(mean = c(mean1, mean2),
                       sd = c(sd1, sd2), 
                       row.names = c("first", "second"))
  return(result)
}

The above function calculates the means and standard deviations for two sources of input, then combines these statistics in a simple data frame and returns the data frame.

The sources of input are defined in the function arguments argument1 and argument2.

What happens in a function…

The reason why we have to specify function arguments is simple:

\[\text{EVERYTHING THAT HAPPENS IN A FUNCTION COMES FROM THE}\] \[\text{FUNCTION AND STAYS IN THE FUNCTION!}\]

This is because a function opens a seperate environment that only exists for as long as the function operates. This means:

To get information from the global environment to the function’s environment, we need arguments. To properly return information to the global environment, we should use return(). In general, using return() makes it explicit what your function’s return is. For complicated functions this is proper coding procedure, but for simple functions it is not strictly necessary.

Our example function

To put this example function to the test:

mean.sd(argument1 = 1:10,
        argument2 = 3:8)
##        mean       sd
## first   5.5 3.027650
## second  5.5 1.870829

or, simply:

mean.sd(1:10, 3:8)
##        mean       sd
## first   5.5 3.027650
## second  5.5 1.870829

Pipes

This is a pipe:

boys <- 
  read_sav("boys.sav") %>%
  head()

It effectively replaces head(read_sav("boys.sav")).

Why are pipes useful?

Let’s assume that we want to load data, change a variable, filter cases and select columns. Without a pipe, this would look like

boys  <- read_sav("boys.sav")
boys2 <- transform(boys, hgt = hgt / 100)
boys3 <- filter(boys2, age > 15)
boys4 <- subset(boys3, select = c(hgt, wgt, bmi))

With the pipe:

boys <-
  read_sav("boys.sav") %>%
  transform(hgt = hgt/100) %>%
  filter(age > 15) %>%
  subset(select = c(hgt, wgt, bmi))

Benefit: a single object in memory that is easy to interpret

With pipes

Your code becomes more readable:

  • data operations are structured from left-to-right and not from in-to-out
  • nested function calls are avoided
  • local variables and copied objects are avoided
  • easy to add steps in the sequence

What do pipes do:

  • f(x) becomes x %>% f()
rnorm(10) %>% mean()
## [1] 0.05984628
  • f(x, y) becomes x %>% f(y)
boys %>% cor(use = "pairwise.complete.obs")
##           hgt       wgt       bmi
## hgt 1.0000000 0.6100784 0.1758781
## wgt 0.6100784 1.0000000 0.8841304
## bmi 0.1758781 0.8841304 1.0000000
  • h(g(f(x))) becomes x %>% f %>% g %>% h
boys %>% subset(select = wgt) %>% na.omit() %>% max()
## [1] 117.4

More pipe stuff

The standard %>% pipe

HTML5 Icon

The %$% pipe

HTML5 Icon

The role of . in a pipe

In a %>% b(arg1, arg2, arg3), a will become arg1. With . we can change this.

cats %>%
  plot(Hwt ~ Bwt, data = .)

VS

cats %$%
  plot(Hwt ~ Bwt)

The . can be used as a placeholder in the pipe.

Performing a t-test in a pipe

cats %$%
  t.test(Hwt ~ Sex)
## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = -6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.763753 -1.477352
## sample estimates:
## mean in group Female   mean in group Male 
##             9.202128            11.322680

is the same as

t.test(Hwt ~ Sex, data = cats)

Storing a t-test from a pipe

cats.test <- cats %$%
  t.test(Bwt ~ Sex)

cats.test
## 
##  Welch Two Sample t-test
## 
## data:  Bwt by Sex
## t = -8.7095, df = 136.84, p-value = 8.831e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6631268 -0.4177242
## sample estimates:
## mean in group Female   mean in group Male 
##             2.359574             2.900000

Data visualization with ggplot2

The anscombe data

anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Fitting a line

anscombe %>%
  ggplot(aes(y1, x1)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Fitting a line

Why visualise?

  • We can process a lot of information quickly with our eyes
  • Plots give us information about
    • Distribution / shape
    • Irregularities
    • Assumptions
    • Intuitions
  • Summary statistics, correlations, parameters, model tests, p-values do not tell the whole story

ALWAYS plot your data!

Why visualise?

Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.

Why visualise?

What is ggplot2?

Layered plotting based on the book The Grammer of Graphics by Leland Wilkinsons.

With ggplot2 you

  1. provide the data
  2. define how to map variables to aesthetics
  3. state which geometric object to display
  4. (optional) edit the overall theme of the plot

ggplot2 then takes care of the details

An example: scatterplot

1: Provide the data

mice::boys %>%
  ggplot()

2: map variable to aesthetics

mice::boys %>%
  ggplot(aes(x = age, y = bmi))

3: state which geometric object to display

mice::boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point()

An example: scatterplot

Why this syntax?

Create the plot

gg <- 
  mice::boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point(col = "dark green")

Add another layer (smooth fit line)

gg <- gg + 
  geom_smooth(col = "dark blue")

Give it some labels and a nice look

gg <- gg + 
  labs(x = "Age", y = "BMI", title = "BMI trend for boys") +
  theme_minimal()

Why this syntax?

plot(gg)

Why this syntax?